-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: check instance state on termination failure #2253
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
@@ -43,6 +43,10 @@ var ( | |||
|
|||
type SpotFallbackError error | |||
|
|||
type InstanceTerminatedError struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just make this an error instead of a struct like SpotFallbackError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually started with that, but I think the InstanceTerminatedError
error ends up being the *errors.errorString
type. Which means that the IsTerminatedError
would return true for errors when we expect them to return false. Now that I think of it, I'm not 100% sure the isSpotFallback
method is doing what we intend either. Will have to test it further.
Here's an example, btw.
https://go.dev/play/p/cswSjwGQFL2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o wow, good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Fixes #
#2100 and #1796
Description
If tag conditions are used in Karpenter IAM policies for the TerminateInstances action, Karpenter can get stuck in a loop trying to terminate instances which were previously terminated (and reaped) without Karpenter's knowledge. This fix adds a step where if instance Termination fails, Karpenter will then try to call DescribeInstances to determine the instances state. If the state is "terminated" or if the instance isn't found, then skip the TerminateInstance call and proceed with removing the node in K8S.
How was this change tested?
Manually reproduced the issue by adding tag conditions to the Karpenter IAM policies for the TerminateInstances action. Then launched an instance with said tags. Then, manually removed the tags from the instance. This simulates the case where a terminated instance no longer has the tags required for Karpenter to terminate.
Next, manually deleted the node in K8S, which caused Karpenter to continuously attempt to delete the EC2 instance, but fails with an UnauthorizedException error (this is intended behavior).
Finally, manually terminated the EC2 instance, leaving it in "terminated" state. The next time Karpenter tries to terminate the instance, it also calls DescribeInstances and sees the instance is "terminated". Karpenter then skips the TerminateInstance call and removes the node object.
Does this change impact docs?
Release Note
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.